In my PhD thesis, I compare two methods to perform optimal importance sampling: the Cross-Entropy method (CE) and Efficient Importance Sampling (EIS). There are several criteria that you may want to consider for this comparison, but without going into too much detail , in this post, I’ll focus on only one aspect: asymptotics, in particular asymptotic relative efficiencies.

If you’re interested, check out the draft of my thesis on Github.

In the context of my thesis, this crops up as both methods are simulation-based. Usually, we have to resort to simulation techniques when analytical computation is too difficult, which is exactly the case here! Both methods actually want to solve an optimization problem in the background, solving for an optimal parameter \(\psi\). Unfortunately, the optimization problems involve quantities that we have no (analytical) access to. We can, however, set up a simulation routine that provides an estimate \(\hat\psi\) of this optimal parameter.

But using simulations means that the output of both methods is now random, i.e. we can expect to get different results \(\hat\psi\) when we repeatedly apply the methods using different RNG seeds. This can be problematic: if there is too much variation the actual \(\hat\psi\) we obtain could be far away from the optimal \(\psi\) and the performance of the methods suffers.

This is where asymptotics come into play. When statisticians talk about asymptotics for estimators, here \(\hat\psi\), they are usually concerned with two things: consistency and asymptotic normality. Consistency is a critical property of an estimator.

Consistency

Definition (consistency)

Let \((\hat\psi_{N})_{N \in \mathbf N}\) be a sequence of estimator of \(\psi\). We say that \(\hat\psi_{N}\) is weakly consistent if \[ \hat\psi_{N} \stackrel{\mathbb P}{\to} \psi \] and strongly consistent, if \[ \hat \psi_{N} \stackrel{\text{a.s.}}{\to} \psi. \]

In the following I’ll be a little bit sloppy and identify \(\hat\psi_{N}\) with the sequence of estimators \((\hat\psi_{N})_{N \in \mathbf N}\).

Consistency captures the intuitive property that, as we increase the number of samples \(N\) used in our simulation routine, the noisy estimate \(\hat\psi_N\) should get closer and closer to the true \(\psi\). Without consistency, our simulation routine is essentially useless.

But consistency alone does not tell us how large we should choose our sample size \(N\). For that, central limit theorems can help.

Asymptotic normality

Definition (central limit theorem)

Again, let \((\hat\psi_{N})_{N \in \mathbf N}\) be a sequence of estimator of \(\psi\). If \[ \sqrt {N} \left( \hat \psi_N - \psi \right) \stackrel{\mathcal D}{\to} \mathcal N(0, \Sigma), \] we say that \(\hat\psi_{N}\) fulfills a central limit theorem with asymptotic covariance matrix \(\Sigma\).

If we have a consistent estimator, notice that \(\hat\psi_N - \psi\) goes to \(0\) as \(N\) goes to \(\infty\) (in probability or almost surely, depending on the type of consistency). Multiplying by \(\sqrt{N}\) “blows up” this error, essentially zooming in to what happens around the true value \(\psi\). Note that any estimator that fulfills a central limit theorem is weakly consistent.

Why should we assume a normal distribution as the limiting distribution? At first this choice may seem surprising, maybe even restrictive. But it turns out that for large classes of estimators, e.g. M- and Z-estimators, we can proof such a central limit theorem.

How does having a central limit theorem help us? If we know \(\Sigma\), we can use it get a heuristic on how large we should choose \(N\). Using the approximation \[ \hat\psi_{N} -\psi \approx \mathcal N\left(0, \frac{\Sigma}{N}\right) \] we can choose \(N\) such that, e.g., for a given error \(\varepsilon > 0\), the probability that \(\lVert \hat \psi_N - \psi \rVert < \varepsilon\) becomes, say, at least \(80\%\).

Additionally, if we have two estimators \(\hat \psi^1\) and \(\hat\psi^{2}\), both of which fulfill a central limit theorem with asymptotic covariance matrices \(\Sigma_1\) and \(\Sigma_{2}\), we can compare \(\Sigma_1\) and \(\Sigma_2\). The estimator with the “smaller” asymptotic covariance matrix is the one we should prefer.

As we are dealing with matrices it is not necessarily clear what smaller means, we could be interested in, e.g.

\(\Sigma_1 \succ \Sigma_{2}\), i.e. \(\Sigma_1 - \Sigma_2\) is postive definite, or
\(\operatorname{trace} \left( \Sigma_{1} \right) > \operatorname{trace} \left( \Sigma_{2} \right)\), i.e. the asymptotic mean squared error (MSE) is smaller for \(\psi_2\).

Usually the asymptotic covariance matrix \(\Sigma\) is not known, usually because it depends on either the true parameter \(\psi\). For a toy example, consider the normal distribution \(\mathcal N(\mu, \sigma^2)\) where the mean \(\mu \in \mathbf R\) and variance is \(\sigma^2 \in \mathbf R_{> 0}\). Then the sample mean \(\bar X_{N}\) for samples \(X_{1}, \dots, X_{N} \stackrel{\text{i.i.d.}}{\sim} \mathcal N(\mu, \sigma^{2})\) as an estimator of \(\mu\) fulfills a central limit theorem: \[ \sqrt{N} (\bar X_{N} - \mu) \stackrel{\mathcal D}\to \mathcal N(0, \sigma^{2}). \] Here \(\sigma^2\) may be unknown. In practice we can estimate it consistently by the sample variance \(\hat \sigma_N^{2}\), and so \[ \frac{\sqrt{N}}{\sqrt{\hat \sigma_{N}^2}} \left( \bar X_{N} - \mu \right) \stackrel{\mathcal D}\to \mathcal N(0,1), \] by Slutsky’s lemma. Thus, for large \(N\), we can the true asymptotic variance \(\sigma^2\) by its consistent estimate \(\hat\sigma_N^2\), and proceed as above.

What’s next?

In this post, I’ve tried to convey the usefulness of asymptotics, especially when it comes to comparing two estimators \(\psi\). However, I have not given you any guidance on when an estimator is consistent and when it fulfills a central limit theorem. It turns out, that for many types of estimators we can obtain, under some assumptions, central limit theorems. These settings I will discuss in a following post.